言葉の先へ：トークン化とラリポップテストの理解

言語の裏側にある構造

大規模言語モデル（LLM）は、人間のようにテキストを「読む」わけではありません。私たちが文字や単語として見ているのに対し、モデルは数値の塊であるトークンという抽象概念を理解することが、プロンプトエンジニアリングとシステム設計をマスターする第一歩です。

ラリポップテスト

なぜ大規模言語モデル（LLM）は「lollipop」の文字を逆順に処理するのが苦手なのに、「l-o-l-l-i-p-o-p」なら即座に逆順に処理できるのでしょうか？

問題点：標準的な単語では、モデルはその単語全体を表す1つのトークンしか認識しません。そのため、そのトークン内にある個々の文字の位置について明確な「地図」を持ちません。
解決策：単語をハイフンで区切ることで、モデルに各文字を個別にトークン化させることが可能になります。これにより、タスクを実行するために必要な細かい「視覚」が得られます。

基本原則

トークン比率：一般的なルールとして、英語では1トークンは約4文字分、または約0.75語分相当です。
コンテキストウィンドウ：モデルには固定された「メモリ」サイズ（例：4096トークン）があります。この制限には、あなたの指示とモデルの応答の両方が含まれます。

ベースモデルとインストラクションチューニング済みモデルの違い

ベースモデル（Base LLM）：膨大なデータセットに基づいて、次に最も可能性が高い単語を予測します（例：「フランスの首都は何か？」の後に「ドイツの首都は何か？」が続く場合など）。
インストラクションチューニング済みモデル（Instruction-Tuned LLM）：人間からのフィードバックを用いた強化学習（RLHF）によって微調整され、特定の命令に従い、アシスタントとして動作するように設計されています。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

If you are processing a document that is 3,000 English characters long, roughly how many tokens will the model consume?

A) 3,000 tokens

B) 750 tokens

C) 12,000 tokens

Question 2

Why is an Instruction-Tuned LLM preferred over a Base LLM for building a chatbot?

A) It is faster at generating text.

B) It uses fewer tokens.

C) It is trained to follow specific tasks and dialogue formats.

Challenge: Token Estimation

Apply the token ratio rule to a real-world scenario.

You are designing an automated summarization system. The system receives daily reports that average 10,000 characters in length.

Your API provider charges $0.002 per 1,000 tokens.

Step 1

Estimate the number of tokens for a single daily report.

Solution:
Using the rule of thumb (1 token ≈ 4 characters):
$$ \text{Tokens} = \frac{10,000}{4} = 2,500 \text{ tokens} $$

Step 2

Calculate the estimated cost to process one daily report.

Solution:
The cost is $0.002 per 1,000 tokens.
$$ \text{Cost} = \left( \frac{2,500}{1,000} \right) \times 0.002 = 2.5 \times 0.002 = \$0.005 $$